Skip to content

fix(registration): jitter cooldown exit and rate-limit registration retries#860

Merged
andrewazores merged 4 commits intocryostatio:mainfrom
andrewazores:registration-herd
May 4, 2026
Merged

fix(registration): jitter cooldown exit and rate-limit registration retries#860
andrewazores merged 4 commits intocryostatio:mainfrom
andrewazores:registration-herd

Conversation

@andrewazores
Copy link
Copy Markdown
Member

Based on #858
Depends on #858
See #851

Adds two more behaviours:

  1. adds jitter to the cooldown time so that if multiple Agent instances enter failure cooldown around the same time, they don't all exit cooldown at the same moment and flood the Cryostat server. This can happen if the Cryostat server itself has failed, for example.
  2. adds a retry rate limit on registration so that the Agent will not re-attempt registration too rapidly, even if it has been pinged by the Cryostat server asking it to refresh its registration.

@andrewazores andrewazores force-pushed the registration-herd branch 2 times, most recently from 7293cdf to f3c4af4 Compare May 1, 2026 14:07
@andrewazores andrewazores marked this pull request as ready for review May 1, 2026 15:17
@andrewazores andrewazores requested a review from a team May 1, 2026 15:17
Copy link
Copy Markdown
Member

@jtolentino1 jtolentino1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my testing.

I tested the newer integrated images (cryostat-agent-init:registration-herd-6 and cryostat:4.2.0-registration-herd- 5) on OpenShift with 22 injected Agent replicas. The 30+ minute soak stayed stable with 22 ready pods and 22 Cryostat targets, and aliases/connectUrls matched the live Agent pods. Scaling 22 -> 12 -> 22 also converged and Cryostat tracked the instances correctly.

I also tested the registration behavior directly. Repeated refresh pings returned 204, but the Agent logged the
minimum-interval skips instead of rapidly re-registering, and the credential id stayed unchanged. After killing
Cryostat with kill 1, several Agents entered cooldown with different jittered durations around the 30s base, and the
system recovered back to 22/22 targets after about 3 minutes.

For the Cryostat-side changes, I saw the restart path using periodic discovery jobs with no old discovery.startup
jobs left, and the new fault-tolerance rate limits fired for the registration/credential paths during recovery.

One note: after restart/recovery I did see stale discovery.periodic Quartz jobs logging Plugin not found, and the
DB had more periodic jobs/credentials than live plugins, but the visible target state recovered correctly.

@andrewazores
Copy link
Copy Markdown
Member Author

Thanks for the detailed analysis @jtolentino1 !

One note: after restart/recovery I did see stale discovery.periodic Quartz jobs logging Plugin not found, and the
DB had more periodic jobs/credentials than live plugins, but the visible target state recovered correctly.

This is "expected" in the current server-side implementation - when the job next runs, it'll cancel its own trigger if it detects that the Target it's set up for has disappeared. After a few minutes the persisted periodic jobs state in the database should settle back to 1:1 with the discovered targets once the system has made a full recovery.

https://github.com/cryostatio/cryostat/blob/8fced699fe4def3aa5fbd95fdc16ce18dabc2789/src/main/java/io/cryostat/discovery/Discovery.java#L974

Credentials should also eventually settle back to 1:1, but it's not critical if there are stale Credentials left around with 0 matching targets. If that is the case then that's another bug we should fix, but I think that can wait.

@andrewazores andrewazores merged commit 7f6af28 into cryostatio:main May 4, 2026
9 checks passed
@andrewazores andrewazores deleted the registration-herd branch May 4, 2026 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants